AITopics | benchmark test

Collaborating Authors

benchmark test

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

China beats U.S. with world's fastest supercomputer, but race not geared for AI work

The Japan TimesJun-23-2026, 22:10:00 GMT

China beats U.S. with world's fastest supercomputer, but race not geared for AI work Workers at Elon Musk's xAI facility, which houses a large supercomputer known as Colossus, used for Artificial Intelligence (AI) data processing, in Memphis, Tennessee, on Sept. 11, 2025 | REUTERS SAN FRANCISCO - China has overtaken the U.S. to win the top spot on a list of the world's fastest supercomputers, but the results may say more about Beijing's desire to show self-sufficiency in computing systems than its standing in the global AI race, experts said. The LineShine system at the National Supercomputing Center in Shenzhen, China, uses domestically designed chips and won the top spot on the TOP500, a biannual global ranking of supercomputers, with the country's first listing in three years. The ranking comes as the U.S. and China are increasingly competing in advanced computing, with U.S. President Donald Trump on Monday signing an executive order that aims to put the U.S. ahead of China in the emerging field of quantum computing. In the June 2026 edition of TOP500, LineShine beat out the previous titleholder, El Capitan, a supercomputer housed at Lawrence Livermore National Laboratory that the U.S. government uses to develop and maintain its nuclear weapons stockpile. But technology and policy experts said the results do not mean that China has the world's fastest computer for AI work because of changes in the computing industry in recent years and the methods used to compile the list.

artificial intelligence, scientific computing, social media, (13 more...)

The Japan Times

Country:

Asia > Middle East > Iran (0.42)
North America > United States > Tennessee > Shelby County > Memphis (0.25)
North America > United States > California > San Francisco County > San Francisco (0.25)
(2 more...)

Industry:

Government > Regional Government > North America Government > United States Government (1.00)
Government > Military (1.00)

Technology:

Information Technology > Scientific Computing (1.00)
Information Technology > Artificial Intelligence (1.00)
Information Technology > Communications > Social Media (0.75)

Add feedback

Realistic Handwritten Multi-Digit Writer (MDW) Number Recognition Challenges

Wagstaff, Kiri L.

arXiv.org Artificial IntelligenceDec-2-2025

Isolated digit classification has served as a motivating problem for decades of machine learning research. In real settings, numbers often occur as multiple digits, all written by the same person. Examples include ZIP Codes, handwritten check amounts, and appointment times. In this work, we leverage knowledge about the writers of NIST digit images to create more realistic benchmark multi-digit writer (MDW) data sets. As expected, we find that classifiers may perform well on isolated digits yet do poorly on multi-digit number recognition. If we want to solve real number recognition problems, additional advances are needed. The MDW benchmarks come with task-specific performance metrics that go beyond typical error calculations to more closely align with real-world impact. They also create opportunities to develop methods that can leverage task-specific knowledge to improve performance well beyond that of individual digit classification methods.

artificial intelligence, digit, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2512.00676

Country: North America > United States (1.00)

Genre: Research Report (0.40)

Industry:

Government > Regional Government > North America Government > United States Government (0.94)
Education (0.70)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.47)

Add feedback

KunlunBaize: LLM with Multi-Scale Convolution and Multi-Token Prediction Under TransformerX Framework

Li, Cheng, Liu, Jiexiong, Chen, Yixuan, Jia, Yanqin, Li, Zhepeng

arXiv.org Artificial IntelligenceMar-19-2025

Large language models have demonstrated remarkable performance across various tasks, yet they face challenges such as low computational efficiency, gradient vanishing, and difficulties in capturing complex feature interactions. To address these limitations, a novel framework has been proposed. This framework incorporates a learnable dense residual skip connection mechanism, a TransformerX module a transformer based component integrating multiscale convolution and adaptive activation functions and a multitoken prediction interaction module. The learnable dense residual connections enhance information flow and feature capture across layers. Within the TransformerX module, large convolutional kernels aggregate semantic information from extensive text segments, while smaller convolutions focus on local word order and syntactic structures. The adaptive activation function dynamically adjusts its parameters based on the semantic features of the input text, improving the model's ability to handle diverse semantic expressions and complex relationships. The multitoken prediction module boosts data utilization and accelerates inference by predicting multiple future tokens. These components significantly enhance the performance and efficiency of large language models.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2503.04784

Genre: Research Report (0.51)

Industry: Education (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Chatbots Are Cheating on Their Benchmark Tests

The Atlantic - TechnologyMar-5-2025, 18:57:06 GMT

Generative-AI companies have been selling a narrative of unprecedented, endless progress. Just last week, OpenAI introduced GPT-4.5 as its "largest and best model for chat yet." Earlier in February, Google called its latest version of Gemini "the world's best AI model." And in January, the Chinese company DeekSeek touted its R1 model as being just as powerful as OpenAI's o1 model--which Sam Altman had called "the smartest model in the world" the previous month. Yet there is growing evidence that progress is slowing down and that the LLM-powered chatbot may already be near its peak.

benchmark, large language model, machine learning, (20 more...)

The Atlantic - Technology

Country: Asia > China > Fujian Province > Xiamen (0.05)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.79)

Add feedback

Can We Trust AI Benchmarks? An Interdisciplinary Review of Current Issues in AI Evaluation

Eriksson, Maria, Purificato, Erasmo, Noroozian, Arman, Vinagre, Joao, Chaslot, Guillaume, Gomez, Emilia, Fernandez-Llorca, David

arXiv.org Artificial IntelligenceFeb-10-2025

Quantitative Artificial Intelligence (AI) Benchmarks have emerged as fundamental tools for evaluating the performance, capability, and safety of AI models and systems. Currently, they shape the direction of AI development and are playing an increasingly prominent role in regulatory frameworks. As their influence grows, however, so too does concerns about how and with what effects they evaluate highly sensitive topics such as capabilities, including high-impact capabilities, safety and systemic risks. This paper presents an interdisciplinary meta-review of about 100 studies that discuss shortcomings in quantitative benchmarking practices, published in the last 10 years. It brings together many fine-grained issues in the design and application of benchmarks (such as biases in dataset creation, inadequate documentation, data contamination, and failures to distinguish signal from noise) with broader sociotechnical issues (such as an over-focus on evaluating text-based AI models according to one-time testing logic that fails to account for how AI models are increasingly multimodal and interact with humans and other technical systems). Our review also highlights a series of systemic flaws in current benchmarking practices, such as misaligned incentives, construct validity issues, unknown unknowns, and problems with the gaming of benchmark results. Furthermore, it underscores how benchmark practices are fundamentally shaped by cultural, commercial and competitive dynamics that often prioritise state-of-the-art performance at the expense of broader societal concerns. By providing an overview of risks associated with existing benchmarking procedures, we problematise disproportionate trust placed in benchmarks and contribute to ongoing efforts to improve the accountability and relevance of quantitative AI benchmarks within the complexities of real-world scenarios.

benchmark, large language model, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2502.06559

Country:

North America > Canada > Ontario > Toronto (0.14)
Europe > Spain > Andalusia > Seville Province > Seville (0.05)
South America > Brazil > Rio de Janeiro > Rio de Janeiro (0.04)
(15 more...)

Genre:

Overview (1.00)
Research Report > New Finding (0.67)

Industry:

Law (1.00)
Health & Medicine (1.00)
Government > Regional Government > North America Government > United States Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Vision (0.93)
Information Technology > Artificial Intelligence > Issues > Social & Ethical Issues (0.67)

Add feedback

Training on the Benchmark Is Not All You Need

Ni, Shiwen, Kong, Xiangtao, Li, Chengming, Hu, Xiping, Xu, Ruifeng, Zhu, Jia, Yang, Min

arXiv.org Artificial IntelligenceSep-3-2024

The success of Large Language Models (LLMs) relies heavily on the huge amount of pre-training data learned in the pre-training phase. The opacity of the pre-training process and the training data causes the results of many benchmark tests to become unreliable. If any model has been trained on a benchmark test set, it can seriously hinder the health of the field. In order to automate and efficiently test the capabilities of large language models, numerous mainstream benchmarks adopt a multiple-choice format. As the swapping of the contents of multiple-choice options does not affect the meaning of the question itself, we propose a simple and effective data leakage detection method based on this property. Specifically, we shuffle the contents of the options in the data to generate the corresponding derived data sets, and then detect data leakage based on the model's log probability distribution over the derived data sets. If there is a maximum and outlier in the set of log probabilities, it indicates that the data is leaked. Our method is able to work under black-box conditions without access to model training data or weights, effectively identifying data leakage from benchmark test sets in model pre-training data, including both normal scenarios and complex scenarios where options may have been shuffled intentionally or unintentionally. Through experiments based on two LLMs and benchmark designs, we demonstrate the effectiveness of our method. In addition, we evaluate the degree of data leakage of 31 mainstream open-source LLMs on four benchmark datasets and give a ranking of the leaked LLMs for each benchmark, and we find that the Qwen family of LLMs has the highest degree of data leakage.

benchmark, data leakage, scenario, (15 more...)

arXiv.org Artificial Intelligence

2409.0179

Country:

Asia > China > Guangdong Province > Shenzhen (0.04)
Asia > China > Heilongjiang Province > Harbin (0.04)

Genre: Research Report (0.40)

Industry: Education (0.69)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

In latest benchmark test of AI, it's mostly Nvidia competing against Nvidia

#artificialintelligenceNov-28-2022, 13:06:18 GMT

For lack of rich competition, some of Nvidia's most significant results in the latest MLPerf were against itself, comparing its newest GPU, H100 "Hopper," to its existing product, the A100. Although chip giant Nvidia tends to cast a long shadow over the world of artificial intelligence, its ability to simply drive competition out of the market may be increasing, if the latest benchmark test results are any indication. Did you miss out on Black Friday 2022? No problem: Cyber Monday deals are here, with internet retailers offering their lowest prices of the year. ZDNET is surfacing the latest and best sales online in real time for you to check out now.

competition, nvidia, top score, (16 more...)

#artificialintelligence

Country:

Europe > France > Auvergne-Rhône-Alpes > Isère > Grenoble (0.05)
Asia > China (0.05)

Industry: Information Technology > Hardware (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Nvidia's Impressive H100 MLPerf Benchmark

#artificialintelligenceNov-11-2022, 19:48:17 GMT

In the complex world of AI/ML processing, it can be hard to compare products from various vendors due to the wide range of models and workloads in use. MLPerf is a consortium of major industry players and research organizations that provides agreed-upon benchmark tests to try and standardize test results across various vendor offerings to give users a chance to evaluate competing performance claims. Nvidia has previously provided MLPerf test results for its A100 product. It has just released its MLPerf benchmarks for its new high end device, the H100. It sports an impressive 6.7X performance gain over the older A100 devices in certain workloads, and is still being optimized with software that could eventually push the performance even higher.

mlperf benchmark, nvidia, test result, (8 more...)

#artificialintelligence

Industry: Information Technology > Hardware (0.74)

Technology:

Information Technology > Communications > Social Media (0.40)
Information Technology > Artificial Intelligence (0.39)
Information Technology > Cloud Computing (0.33)

Add feedback

Statistical Tests for Comparing Classification Algorithms

#artificialintelligenceNov-24-2021, 21:05:01 GMT

Comparing prediction methods to define which one should be used for the task at hand is a daily activity for most data scientists. Usually, one will have a pool of classification models and will validate them using cross-validation to define which one is best. Another goal, however, is not to compare classifiers, but the learning algorithms themselves. The idea is: given this task (data), which learning algorithm (KNN, SVM, Random Forests, etc) will generate more accurate classifiers on a dataset of size D? As we will see, every method presented here has some pros and cons. However, the first intuition of using a two proportions test can lead to some really bad results.

algorithm, implementation, statistical test, (15 more...)

#artificialintelligence

Country: North America > United States > California > Orange County > Irvine (0.05)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis (0.56)
Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (0.36)

Add feedback

Facebook: Here comes the AI of the Metaverse

#artificialintelligenceOct-15-2021, 02:16:11 GMT

To operate in augmented and virtual reality, Facebook believes artificial intelligence will need to develop an "egocentric perspective." To that end, the company on Thursday announced Ego4D, a data set of 2,792 hours of first-person video, and a set of benchmark tests for neural nets, designed to encourage the development of AI that is savvier about what it's like to move through virtual worlds from a first-person perspective. The project is a collaboration between Facebook Reality Labs and scholars from 13 research institutions, including academic institutions and research labs. The details are laid out in a paper lead-authored by Facebook's Kristen Grauman, "Ego4D: Around the World in 2.8K Hours of Egocentric Video." Grauman is a scientist with the company's Facebook AI Research unit.

facebook, neural net, video, (14 more...)

#artificialintelligence

Industry: Information Technology > Services (0.52)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Human Computer Interaction > Interfaces > Virtual Reality (0.51)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.42)

Add feedback